Bioinformatics Advances — Latest Matching Preprints

1

STELLA: Self-Evolving LLM Agent for Biomedical Research

Jin, R.; ZHANG, Z.; Wang, M.; Cong, L.

2025-07-05 bioinformatics 10.1101/2025.07.01.662467 medRxiv

Top 0.1%

38.5%

Show abstract

The staggering complexity of modern biomedical research has intensified the aspiration for a generalist "Biomedical World Model", yet current AI agents remain constrained by static capabilities and a lack of self-evolution mechanisms. To bridge this gap, we present STELLA, a self-evolving multimodal agent designed to progressively refine its computational reasoning and physical execution through interaction. STELLA operates via a collaborative multi-agent framework (comprising Manager, Developer, Critic, Critic, and Tool Creation agents) that continuously updates reasoning templates and autonomously expands a dynamic "Tool Ocean". We demonstrate STELLAs capabilities on the created Tool Creation Benchmark, where it attains a score of 4.01/5 with 100% task completion, significantly outperforming state-of-the-art models including GPT-5, Claude 4 Opus, and Biomni. Beyond computational metrics, STELLA drives experimentally validated scientific discovery. In oncology, the agent identified Butyrophilin Subfamily 3 Member A1 (BTN3A1) as a novel negative regulator of natural killer (NK) cell function in acute myeloid leukemia (AML), verified via CRISPR knockout studies. In protein engineering, STELLA orchestrated a complete directed evolution workflow for the enzyme strictosidine synthase, identifying variants, notably M276L, exhibiting more than a two-fold improvement in catalytic activity. Finally, the system extends to physical laboratory automation by training Vision-Language-Action (VLA) models through a Decompose-Monitor-Recover mechanism, which increased success rates from 17% to 82%. By integrating autonomous tool evolution, biological discovery, and robotic control, STELLA offers a blueprint for a self-evolving world model in the life sciences.

2

BRIDGE: Biological Antimicrobial Resistance Inference viaDomain-Knowledge Graph Embeddings

Iyer, A.; Kazeem, Y.; Kafaie, S.; Rajabi, E.

2026-02-11 bioinformatics 10.64898/2026.02.09.704676 medRxiv

Top 0.1%

34.1%

Show abstract

Antimicrobial resistance (AMR) is a growing global health crisis, responsible for an estimated 1.27 million deaths in 2019 alone. Traditional approaches to identifying antibiotic resistance genes (ARGs) are often labour-intensive and limited in their ability to detect novel resistance mechanisms. In this study, we propose BRIDGE, a knowledge graph-based framework, to improve AMR gene prediction by integrating gene neighbourhood information and protein-protein interaction networks. Focusing on Klebsiella pneumoniae and Escherichia coli, we construct a comprehensive and biologically grounded knowledge graph using curated data from CARD, STRING, and DrugBank. We apply knowledge graph embedding models which are fed into deep neural networks to infer novel AMR links, achieving classification accuracy of up to 97%. Our results demonstrate that incorporating biologically meaningful relationships, such as gene neighbourhood information and protein interactions, enhances the predictive accuracy and interpretability of AMR link predictions. This work contributes to the development of scalable and data-integrated approaches for advancing antimicrobial resistance surveillance and drug discovery. BRIDGE implementation and data are available at https://github.com/GraphML-lab/BRIDGE.

3

GSEA-InContext Explorer: An interactive visualization tool for putting gene set enrichment analysis results into biological context.

Powers, R. K.; Sun, A.; Costello, J.

2019-06-04 bioinformatics 10.1101/659847 medRxiv

Top 0.1%

30.7%

Show abstract

SummaryGSEA-InContext Explorer is a Shiny app that allows users to perform two methods of gene set enrichment analysis (GSEA). The first, GSEAPreranked, applies the GSEA algorithm in which statistical significance is estimated from a null distribution of enrichment scores generated for randomly permuted gene sets. The second, GSEA-InContext, incorporates a user-defined set of background experiments to define the null distribution and calculate statistical significance. GSEA-InContext Explorer allows the user to build custom background sets from a compendium of over 5,700 curated experiments, run both GSEAPreranked and GSEA-InContext on their own uploaded experiment, and explore the results using an interactive interface. This tool will allow researchers to visualize gene sets that are commonly enriched across experiments and identify gene sets that are uniquely significant in their experiment, thus complementing current methods for interpreting gene set enrichment results.\n\nAvailability and implementationThe code for GSEA-InContext Explorer is available at: https://github.com/CostelloLab/GSEA-InContext_Explorer and the interactive tool is at: http://gsea-incontext_explorer.ngrok.io

4

gffutilsAI: An AI-Agent for Interactive Genomic Feature Exploration in GFF files

Bassi, S.; Gonzalez, C.; Yang, T.

2025-12-05 bioinformatics 10.64898/2025.12.02.690645 medRxiv

Top 0.1%

27.8%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe General Feature Format (GFF) is widely used to represent genomic annotations, but its hierarchical, multi-attribute structure makes manual querying and analysis challenging. Existing libraries such as gffutils provide programmatic interfaces, yet they require coding proficiency. gffutilsAI is a novel AI-powered command-line agent that enables researchers to perform interactive, natural-language-driven exploration of GFF files. Built on top of the gffutils library and the Strands AI agent framework, gffutilsAI integrates local and cloud-based large language models (LLMs) such as Llama 3.1, GPT-5, and Claude 3.5 to translate human queries into executable actions. The tool supports coordinate-based queries, attribute and GO searches, hierarchical traversal, statistical summaries, and CSV export, offering a new paradigm for accessible conversational genomics.

5

RFMix-reader: Accelerated reading and processing for local ancestry studies

Benjamin, K. J. M.

2024-07-17 bioinformatics 10.1101/2024.07.13.603370 medRxiv

Top 0.1%

27.4%

Show abstract

MotivationLocal ancestry inference is a powerful technique in genetics, revealing population history and the genetic basis of diseases. It is particularly valuable for improving eQTL discovery and fine-mapping in admixed populations. Despite the widespread use of the RFMix software for local ancestry inference, large-scale genomic studies face challenges of high memory consumption and processing times when handling RFMix output files. ResultsHere, I present RFMix-reader, a new Python-based parsing software, designed to streamline the analysis of large-scale local ancestry datasets. This software prioritizes computational eiciency and memory optimization, leveraging GPUs when available for additional speed boosts. By overcoming these data processing hurdles, RFMix-reader empowers researchers to unlock the full potential of local ancestry data for understanding human health and health disparities. AvailabilityRFMix-reader is freely available on PyPI at https://pypi.org/project/rfmix-reader/, implemented in Python 3, and supported on Linux, Windows, and Mac OS. ContactKynonJade.Benjamin@libd.org Supplementary informationSupplementary data are available at https://rfmix-reader.readthedocs.io/en/latest/.

6

Agentomics: An Agentic System that Autonomously Develops Novel State-of-the-art Solutions for Biomedical Machine Learning Tasks

Martinek, V.; Gariboldi, A.; Tzimotoudis, D.; Galea, M.; Zacharopoulou, E.; Alberdi Escudero, A.; Blake, E.; Cechak, D.; Cassar, L.; Balestrucci, A.; Alexiou, P.

2026-01-30 bioinformatics 10.64898/2026.01.27.702049 medRxiv

Top 0.1%

27.3%

Show abstract

MotivationExtracting knowledge from biomedical data is crucial for advancing our understanding of biological systems and developing novel therapeutics. The quantity, quality, and resolution of biomedical data constantly evolves, requiring the automation of biomedical machine learning (ML). Existing Automated ML tools lack flexibility, while Large Language Models (LLMs) struggle to consistently deliver reproducible machine learning codebases, and existing LLM Agent-powered solutions lag behind human-engineered ML models. ResultsHere, we introduce Agentomics, an autonomous LLM-powered agentic system for end-to-end ML experimentation. Given a biomedical dataset, Agentomics implements various ML modeling strategies, and produces a ready-to-use ML model. Agentomics introduces strict validation checkpoints for standard ML development steps, allowing gradual development on top of working code with defined interfaces and validated artifacts. Further, it offers native support for biomedical foundation models that can be leveraged during experimentation. The generic nature of Agentomics allows the user to create ML solutions for a large variety of datasets and use various LLMs. We evaluate Agentomics across 20 datasets from the domains of Protein Engineering, Drug Discovery, and Regulatory Genomics. When benchmarked against other agentic systems, Agentomics outperformed them in all tested domains. When benchmarked against human expert solutions, Agentomics generated novel state-of-the-art models for 11/20 established benchmark datasets. Availability and ImplementationAgentomics is implemented in Python. Source code and documentation are freely available at: https://github.com/BioGeMT/Agentomics-ML. Contactpanagiotis.alexiou@um.edu.mt

7

Creating a biomedical knowledge base by addressing GPT inaccurate responses and benchmarking context

Darnell, S. S.; Prins, J. P.; Suh, E.; Huang, P.; Williams, R. W.; Overall, R.; Chen, H.; Garrison, E.; Guarracino, A.; Villani, F.; Muli, P.; Ashbrook, D. G.; Colonna, V.; Batten, C.; Sen, S.; Muriithi, F. M.; Yousefi, S.; Nijveen, H.; Lisso, F.; Isaac, A.; Kabui, A.; Kilungi, M. B.; Kibet, A.; Umar, M.; Muhia, B.

2024-10-18 scientific communication and education 10.1101/2024.10.16.618663 medRxiv

Top 0.1%

27.2%

Show abstract

We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimers and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT hallucinations, we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and citizen scientists on the relevance of GPT responses. A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. When evaluating the responses to their questions, human respondents give a "thumbs-up" 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT-generates questions, RAGAS scores 74% on answer relevance. With RAGAS we created a benchmark that can be used to continuously assess the performance of our knowledge base. Full GNQA functionality is embedded in the free GeneNetwork.org web service, an open-source system containing over 25 years of experimental data on model organisms and human. The code developed for this study is published under a free and open-source software license at https://git.genenetwork.org/gn-ai/tree/README.md.

8

RNA-Protein Interaction Classification via Sequence Embeddings

Matus, D.; Runge, F.; Franke, J. K. H.; Gerne, L.; Uhl, M.; Backofen, R.; Hutter, F.

2024-11-11 bioinformatics 10.1101/2024.11.08.622607 medRxiv

Top 0.1%

27.2%

Show abstract

RNA-protein interactions (RPI) are ubiquitous in cellular organisms and essential for gene regulation. In particular, protein interactions with non-coding RNAs (ncRNAs) play a critical role in these processes. Experimental analysis of RPIs is time-consuming and expensive, and existing computational methods rely on small and limited datasets. This work introduces RNAInterAct, a comprehensive RPI dataset, alongside RPIembeddor, a novel transformer-based model designed for classifying ncRNA-protein interactions. By leveraging two foundation models for sequence embedding, we incorporate essential structural and functional insights into our task. We demonstrate RPIembeddors strong performance and generalization capability compared to state-of-the-art methods across different datasets and analyze the impact of the proposed embedding strategy on the performance in an ablation study.

9

BRAVE: a highly accurate method for predicting HIV-1 antibody resistance using large language models for proteins

El Anbari, M.; Bylund, T.; O'Dell, S.; Tourtellott, E.; McKee, K.; Schmidt, S. D.; Mkhize, N. N.; Moore, P. L.; Doria-Rose, N.; Zhou, T.; Rawi, R.

2025-07-31 bioinformatics 10.1101/2025.07.28.667234 medRxiv

Top 0.1%

26.9%

Show abstract

MotivationBroadly neutralizing antibodies (bNAbs) that target the envelope glycoprotein (Env) of human immunodeficiency virus-1 (HIV-1) have been utilized in clinical trials aimed at preventing and treating HIV-1 infections. However, the emergence of neutralization resistance to bNAbs occurs rapidly due to the high mutation rate of HIV-1. Previous studies have suggested the use of in silico methods to effectively predict the resistance of HIV-1 isolates to bNAbs. In this study, we present a novel machine learning approach called BRAVE (Bnab Resistance Analysis Via Evolutionary scale modeling 2) designed to predict HIV-1 resistance against 33 known bNAbs. This innovative tool employs a Random Forests classifier that uses a protein language model to reliably capture protein features. ResultsBRAVE outperformed leading resistance prediction tools on various performance metrics, attaining the highest performance in established classification measures including accuracy, area under the curve, logarithmic loss, and F1-score. Importantly, rigorous statistical comparisons (p<0.001) show that BRAVE is significantly more accurate than state-of-the-art neutralization prediction tools. BRAVE will facilitate informed decisions of antibody usage and sequence-based monitoring of viral escape in clinical settings. Availability and implementationBRAVE software is available for download under GitHub (https://github.com/kiryst/BRAVE/tree/master). Contactreda.rawi@nih.gov Supplementary informationSupplementary data are available at Bioinformatics online.

10

Phenomics Assistant: An Interface for LLM-based Biomedical Knowledge Graph Exploration

O'Neil, S. T.; Schaper, K.; Elsarboukh, G.; Reese, J. T.; Moxon, S. A. T.; Harris, N. L.; Munoz-Torres, M. C.; Robinson, P. N.; Haendel, M. A.; Mungall, C. J.

2024-02-02 bioinformatics 10.1101/2024.01.31.578275 medRxiv

Top 0.1%

26.6%

Show abstract

We introduce Phenomics Assistant, a prototype chat-based interface for querying the Monarch knowledge graph (KG), a comprehensive biomedical database. While unaided Large Large Language models (LLMs) are prone to mistakes in factual recall, their strong abilities in summarization and tool use suggest new opportunities to help non-expert users query and interact with complex data, while drawing on the KG to improve reliability of the answers. Leveraging the ability of LLMs to interpret queries in natural language, Phenomics Assistant enables a wide range of users to interactively discover relationships between diseases, genes, and phenotypes. To assess the reliability of our approach and compare the accuracy of different LLMs, we evaluated Phenomics Assistant answers on benchmark tasks for gene-disease association and gene alias queries. While comparisons across tested LLMs revealed differences in their ability to interpret KG-provided information, we found that even basic KG access markedly boosts the reliability of standalone LLMs. By enabling users to pose queries in natural language and summarizing results in familiar terms, Phenomics Assistant represents a new approach for navigating the Monarch KG.

11

An Interactive AI Agent for Adaptive Modeling of RNA Region-Ligand Interactions via LLM-Generated Machine Learning Workflows

Liu, Z.; Li, Y.; Bao, Y.; Zhou, S.; Dong, Z.; Wang, W.; Jin, H.; Yang, H.; Kuang, Z.; Lin, G. N.; Wang, Z.; Wang, H.

2025-09-17 bioinformatics 10.1101/2025.09.11.675747 medRxiv

Top 0.1%

26.6%

Show abstract

Precise modeling of RNA-ligand interactions is essential for understanding RNA functionality and designing RNA-targeted therapeutics. Current computational approaches largely focus on predicting discrete binding sites, limiting their applicability to complex RNA regions that may harbor multiple or diffuse ligand binding motifs. Here, we present RLAgent, an interactive agent framework designed to predict ligand interactions at the RNA region level, enabling higher-resolution and more flexible modeling than conventional site-centric approaches. RLAgent reframes the RNA-ligand prediction workflow as a dialogue-driven process. Through a natural language interface, users can interactively configure modeling preferences without writing code. A locally hosted large language model (LLM) acts as the core orchestration agent, automating all key components of the modeling pipeline, including data validation, feature encoding, model training, evaluation, and visualization. This agent-based design lowers technical barriers and enhances reproducibility, making RNA-ligand prediction more accessible for both computational and experimental researchers.

12

MkAtt-SDN2GO: Multi-kernel Attentive-SDN2GO Network for Protein Function Prediction in Humans

Jhawar, K.; Bhatt, T.; Sunil, R.; Lipo, W.

2025-11-04 bioinformatics 10.1101/2025.11.02.686093 medRxiv

Top 0.1%

26.5%

Show abstract

Accurately annotating the functions of uncharacterised human proteins remains a major bottleneck in biology. We present MkAtt-SDN2GO, a neural architecture that extends SDN2GO by integrating adaptive multi-kernel convolution and attention mechanisms to predict Gene Ontology terms from protein sequences, domains, and protein-protein interaction (PPI) context. The sequence stream employs a learnable multi-kernel convolution layer that combines features from multiple kernel sizes through attention-based gating, enabling adaptive motif detection without relying on a fixed receptive field. A self-attention layer models long-range dependencies, while cross-attention integrates sequence, domain, and PPI representations into a unified prediction space. On a CAFA-style benchmark, MkAtt-SDN2GO improves Molecular Function (MF) Fmax by 14.8% (0.657 vs 0.572) and Recallmax by 18.8% over SDN2GO. Across Homo sapiens, the fused model achieves top Fmax scores in Biological Process (BP) (0.441), MF (0.657), and Cellular Component (CC) (0.522) compared with other methods. Although the domain-only stream performs strongly, the cross-attention fusion enhances robustness and interpretability when individual modalities are weak or missing. Overall, adaptive multi-scale convolution combined with attention thus advances large-scale protein annotation and offers a scalable and potential tool for functional genomics and disease research.

13

IAN: An Intelligent System for Omics Data Analysis and Discovery

Nagarajan, V.; Shi, G.; Horai, R.; Yu, C.-R.; Gopalakrishnan, J.; Yadav, M.; Liew, M.; Gentilucci, C.; Caspi, R. R.

2025-03-11 bioinformatics 10.1101/2025.03.06.640921 medRxiv

Top 0.1%

26.3%

Show abstract

IAN is an R package that addresses the challenge of integrating, analyzing and interpreting high-throughput "omics" data, using a multi-agent artificial intelligence (AI) system. IAN leverages popular pathway and regulatory datasets (KEGG, WikiPathways, Reactome, GO, ChEA) and the STRING database for protein-protein interactions to perform standard enrichment analysis. The individual enrichment results are then used to generate insightful summaries, for each of the datasets, using a large language model (LLM) through a multi-agent architecture. These summaries are then contextually integrated and interpreted by the LLM, guided by carefully engineered prompts and grounding instructions, to provide insightful explanations, system overview, key regulators, novel observations etc. We demonstrate IANs potential to facilitate biological discovery from complex omics data, by reanalyzing two already published data and evaluating the results. We also show remarkable performance of IAN, in terms of avoiding hallucination. IAN package, along with installation instructions and example usage, is available on https://github.com/NIH-NEI/IAN.

14

GeneWhisperer: Enhancing manual genome annotation with large language models

Li, X.; Whan, A. P.; McNeil, M.; Andrew, S. C.; Dai, X.; Fechner, M.; Paris, C.; Suchecki, R.

2025-04-01 bioinformatics 10.1101/2025.03.30.646211 medRxiv

Top 0.1%

23.6%

Show abstract

Genome annotation is critical for understanding functional elements within genomes. Manual curation is a common practice for identifying the functions of genes, particularly those missed by automated annotation pipelines. However, this process is notoriously labour-intensive and time-consuming. In response to these challenges, we present GeneWhisperer, an innovative assistant system designed to facilitate the manual gene functional annotation process. Utilizing a large language model (LLM) agent, GeneWhisperer provides users access to tools appropriate for specific curation tasks in genome annotation. Featuring a conversational interface, GeneWhisperer fosters effective collaboration between human experts and AI. By synergizing the capabilities of AI with human expertise, GeneWhisperer offers a promising approach to advancing our understanding of gene functions and streamlining the gene annotation process.

15

BioML-bench: Evaluation of AI Agents for End-to-End Biomedical ML

Miller, H. E.; Greenig, M.; Tenmann, B.; Wang, B.

2025-09-04 bioinformatics 10.1101/2025.09.01.673319 medRxiv

Top 0.1%

23.2%

Show abstract

Large language model (LLM) agents hold promise for accelerating biomedical research and development (R&D). Several biomedical agents have recently been proposed, but their evaluation has largely been restricted to question answering (e.g., LAB-Bench) or narrow bioinformatics tasks. Presently, there remains a lack of benchmarks evaluating agent capability in multi-step data analysis workflows or in solving the machine learning (ML) challenges central to AI-driven therapeutics development, such as perturbation response modeling or drug toxicity prediction. We introduce BioML-bench, the first benchmarking suite for evaluating AI agents on end-to-end biomedical ML tasks. BioML-bench spans four domains (protein engineering, single-cell omics, biomedical imaging, and drug discovery) with tasks that require agents to parse a task description, build a pipeline, implement models, and submit predictions graded by established metrics (e.g., AUROC, Spearman). We evaluate four open-source agents: two biomedical specialists (STELLA, Biomni) and two generalists (AIDE, MLAgentBench). On average, agents underperform relative to human baselines, and biomedical specialization does not confer a consistent advantage. We also found that agents which employed more diverse ML strategies more often tended to score highest, suggesting that architecture and scaffolding may be stronger determinants of performance. These findings underscore both the potential and current limits of agentic systems for biomedical ML, and highlight the need for systematic, reproducible evaluations. BioML-bench is provided open-source at github.com/science-machine/biomlbench.

16

PROTEINATOR: Web-UI exploring repurposing hypotheses of PROTEIN InhibiTORs based on protein interactions

Tangadu, S.; Shankar, S.; Varanasi, B. V.; Athri, P.

2019-06-11 bioinformatics 10.1101/667329 medRxiv

Top 0.1%

23.2%

Show abstract

PROTEINATOR is the first version of a staggered, multi-paradigm and extensible drug repurposing platform, focusing on a novel data analytic and integration strategy to find repurposing candidates that have potential to modulate targets through protein-protein interactions. The UI was created as an explorer to find indirect drugs for a protein of interest. PROTEINATOR is developed as a web application that lets researchers search for alternate drugs for a protein of interest, based on the proteins direct interaction with a another druggable protein. This unique tool provides researchers exploring specific implicated protein(s) (in the context of drug development), alternate, plausible routes to modulation by listing proteins that interact with the protein of interest that have reported inhibitors. It is a search engine to identify indirect drugs through connecting various databases, thus avoiding multiple steps and avoiding any manual errors. Using a representative set of databases, 112083 number of indirect drug interactions are discovered that are potential modulators of proteins, detailed annotations of which are provided in the UI. PROTEINATOR is freely available at http://www.proteinator.in.

17

LazyPair: scalable prediction of protein-protein interactions and interaction types

Lim, C. S.; Bhandari, B. K.; Gardner, P. P.

2022-02-22 bioinformatics 10.1101/2022.02.21.481370 medRxiv

Top 0.1%

23.0%

Show abstract

MotivationAlmost all cellular processes require protein-protein interactions. Common interaction types include binding, post-translational modifications, and catalysis. However, existing prediction tools do not take these interaction types into account and do not scale well on proteome-wide prediction. ResultsHere we show that a random forest classifier trained on per-residue physicochemical and biochemical properties is useful for predicting protein-protein interactions. Counterintuitively, we find that training random forests by individual interaction types improves accuracy. Furthermore, a combination of these specialised classifiers improves generalisability. We call our protein-protein interaction prediction tool LazyPair. More importantly, LazyPair outperforms the state-of-the-art in accuracy, generalisability and scalability. Availability and implementationLazyPair and the source code and data for reproducing our analysis are freely available at https://github.com/Gardner-BinfLab/PPI_Analysis_2022 and https://doi.org/10.5281/zenodo.6071630. The web server version and the source code are freely available at https://tisigner.com/lazypair/ and https://github.com/Gardner-BinfLab/TISIGNER-ReactJS, respectively.

18

Lupus Nephritis Subtype Classification With Only Slide Level Labels

Sharma, A.; Chauhan, E.; Uppin, M. S.; Liza, R.; Jawahar, C. V.; Vinod, P. K.

2023-12-04 nephrology 10.1101/2023.12.03.23299357 medRxiv

Top 0.1%

22.6%

Show abstract

Lupus Nephritis classification has historically relied on labor-intensive and meticulous glomerular-level labeling of renal structures in whole slide images (WSIs). However, this approach presents a formidable challenge due to its tedious and resource-intensive nature, limiting its scalability and practicality in clinical settings. In response to this challenge, our work introduces a novel methodology that utilizes only slide-level labels, eliminating the need for granular glomerular-level labeling. A comprehensive multi-stained lupus nephritis digital histopathology WSI dataset was created from the Indian population, which is the largest of its kind. LupusNet, a deep learning MIL-based model, was developed for the sub-type classification of LN. The results underscore its effectiveness, achieving an AUC score of 91.0%, an F1-score of 77.3%, and an accuracy of 81.1% on our dataset in distinguishing membranous and diffused classes of LN.

19

KODA: Agentic Framework for Microbiome Drug Target Discovery

Aminian-Dehkordi, J.; Parsa, M. S.; Naghipourfar, M.; Mofrad, M.

2025-06-01 bioinformatics 10.1101/2025.05.27.656480 medRxiv

Top 0.1%

22.4%

Show abstract

The gut microbiome plays a crucial role in human health and disease, influencing diverse biological processes such as immune regulation and nutrient metabolism. However, the complexity of micro-bial interactions and their metabolic cross-feeding dynamics remains poorly understood. This study proposes KODA, an agentic framework that integrates large language models (LLMs) and knowledge graphs (KGs) to facilitate the discovery of targets in antimicrobial drugs in the gut microbiome. Our approach employs a multi-agent system to interpret natural language queries and translate them into precise graph database queries, enabling intuitive interactions with complex microbiome data. Focusing on KEGG orthologies related to essential microbial genes, KODA identifies potential antimicrobial drug targets by analyzing microbial metabolic pathways. The system employs a Neo4j-based microbiome KG, which integrates microbial interaction data, metabolic models, and KEGG annotations. A dedicated evaluation framework, which incorporates LLM-based reviewers, assesses the quality of generated queries and analytical reports. Our results demonstrate the efficacy of KODA in providing actionable insights for antimicrobial research, particularly in identifying conserved essential genes as potential drug targets. This framework holds the potential to democratize microbiome research by lowering technical barriers and accelerating hypothesis generation in drug discovery.

20

ESM-Effect: An Effective and Efficient Fine-Tuning Framework towards accurate prediction of Mutation's Functional Effect

Glaser, M.; Braegelmann, J.

2025-02-07 bioinformatics Community evaluation 10.1101/2025.02.03.635741 medRxiv

Top 0.1%

22.3%

Show abstract

Predicting functional properties of mutations like the change in enzyme activity remains challenging and is not well captured by traditional pathogenicity prediction. Yet such functional predictions are crucial in areas like targeted cancer therapy where some drugs may only be administered if a mutation causes an increase in enzyme activity. Current approaches either leverage static Protein-Language Model (PLM) embeddings or complex multi-modal features (e.g., static PLM embeddings, structure, and evolutionary data) and either (1) fall short in accuracy or (2) involve complex data processing and pre-training. Standardized datasets and metrics for robust benchmarking would benefit model development but do not yet exist for functional effect prediction. To address these challenges we develop ESM-Effect, an optimized PLM-based functional effect prediction framework through extensive ablation studies. ESM-Effect fine-tunes ESM2 PLM with an inductive bias regression head to achieve state-of-the-art performance. It surpasses the multi-modal state-of-the-art method PreMode, indicating redundancy of structural and evolutionary features, while training 6.7-times faster. In addition, we develop a benchmarking framework with robust test datasets and strategies, and propose a novel metric for prediction accuracy termed relative Bin-Mean Error (rBME): rBME emphasizes prediction accuracy in challenging, non-clustered, and rare gain-of-function regions and correlates more intuitively with model performance than commonly used Spearmans rho. Finally, we demonstrate partial generalization of ESM-Effect to unseen mutational regions within the same protein, illustrating its potential in precision medicine applications. Extending this generalization across different proteins remains a promising direction for future research. ESM-Effect is available at: https://github.com/moritzgls/ESM-Effect.